np problem
OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems
Li, Xiaozhe, Chen, Jixuan, Fang, Xinyu, Ding, Shengyuan, Duan, Haodong, Liu, Qingwen, Chen, Kai
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.
Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs
Yang, Chang, Wang, Ruiyu, Jiang, Junzhe, Jiang, Qi, Zhang, Qinggang, Deng, Yanchen, Li, Shuxin, Hu, Shuyue, Li, Bo, Pokorny, Florian T., Huang, Xiao, Wang, Xinrun
Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).
Quantum Algorithms Conquer a New Kind of Problem
In 1994, a mathematician figured out how to make a quantum computer do something that no ordinary classical computer could. The work revealed that, in principle, a machine based on the rules of quantum mechanics could efficiently break a large number into its prime factors -- a task so difficult for a classical computer that it forms the basis for much of today's internet security. A surge of optimism followed. Perhaps, researchers thought, we'll be able to invent quantum algorithms that can solve a huge range of different problems. "It's been a bit of a bummer trajectory," said Ryan O'Donnell of Carnegie Mellon University.
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > United States > New Jersey (0.05)
- North America > United States > Massachusetts (0.05)
- (2 more...)
What is Complexity?
Sometimes, we hear people talking about how complex or difficult is for a machine to solve a problem, or to perform certain tasks that, in humans, involve thinking. But what is exactly the complexity of a problem? Humans compare and evaluate problems and situations all the time, for example, between two situations s1 and s2, we can say a situation s1 is more complex than s2, because we may have some information about these problems. But, is it possible to define the complexity in terms a machine could understand?, and if possible, how can we measure how complex can a problem be? Perhaps, most of you have already heard of the big O notation.
- North America > United States > Illinois > Cook County > Chicago (0.24)
- Europe > Russia (0.14)
- Asia > Russia (0.14)
- (4 more...)
- Information Technology (1.00)
- Education (0.93)
- Health & Medicine (0.93)
- (2 more...)
Neyman-Pearson Multi-class Classification via Cost-sensitive Learning
Most existing classification methods aim to minimize the overall misclassification error rate, however, in applications, different types of errors can have different consequences. To take into account this asymmetry issue, two popular paradigms have been developed, namely the Neyman-Pearson (NP) paradigm and cost-sensitive (CS) paradigm. Compared to CS paradigm, NP paradigm does not require a specification of costs. Most previous works on NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem, and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, and show that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package "npcs" on CRAN.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- Africa > South Africa (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
What Question Will You Be Remembered For? - Facts So Romantic
John Brockman has run out of questions, and it's a shame. For 20 years, as a sort of homage to his late friend, the conceptual artist James Lee Byars, who in 1968 started "The World Question Center," Brockman has been posing an "Annual Question" to some of the sharpest minds in the world, many of them scientists. Reviewing what might be a representative sample--"What is the most important invention in the past 2,000 years?", "What do you believe is true even though you cannot prove it?", Which is fitting, given the motto of Brockman's website, Edge.org, to which the responses are posted: "To arrive at the edge of the world's knowledge…" Last week, Brockman announced this year's "Annual Question" to be the last, and it has an appropriately culminating feel to it: "What is the last question?" By "the last question," he means "...your last question, the question for which you will be remembered."
Sudoku Science
Millions of people around the world are tackling one of the hardest problems in computer science--without even knowing it. The logic game Sudoku is a miniature version of a longstanding mathematical challenge, and it entices both puzzlers, who see it as an enjoyable plaything, and researchers, who see it as a laboratory for algorithm design. Sudoku has become a worldwide puzzle craze within the past year. Previously known primarily in Japan, it now graces newspapers, Web sites, and best-selling books in dozens of countries. A puzzle consists of a 9-by-9 grid made up of nine 3-by-3 subgrids.
- Asia > Japan (0.25)
- North America > United States > California > Los Angeles County > Los Angeles (0.06)
- North America > United States > New York > Tompkins County > Ithaca (0.05)
- Europe > United Kingdom (0.05)
What Are the Limits of Conventional Computing?
At first glance, the ultimate limit of computation seems to be an engineering issue. How much energy can you put in a chip without melting it? How fast can you flip a bit in your silicon memory? How big can you make your computer and still fit it in a room? These questions don't seem terribly profound.